Search CORE

28 research outputs found

How to handle gender and number agreement in statistical language models

Author: Caroline Lavecchia
Caroline Lavecchia
Caroline Lavecchia
Jean-Paul Haton
Jean-Paul Haton
Jean-Paul Haton
Kamel Smaïli
Kamel Smaïli
Kamel Smaïli
Publication venue
Publication date: 01/01/2006
Field of study

Abstract The agreement in gender and number is a critical problem in statistical language modeling. One of the main difficulties in speech recognition of French language is the presence of misrecognized words due to the bad agreement (in gender and number) between words. Statistical language models do not treat this phenomena directly. This paper focuses on how to handle the issue of this agreement. We introduce an original model called Features-Cache (FC) to estimate the gender and the number of the word to predict. It is a dynamic variable-length Features-Cache. The size of the cache is automatically determined in accordance to syntagm delimitors. The main advantage of this model is that there is no need to any syntactic parsing : it is used as any other statistical language model. Several models have been carried out and the best one achieves an improvement of approximatively 9 points in terms of perplexity. This model has been integrated in a speech recognition system based on JULIUS engine. Tests have been carried out on 280 sentences provided by AUPELF for the French automatic speech recognition evaluation campaign. This new model outperforms the baseline one, in terms of word error, by 3%

CiteSeerX

Building a bilingual dictionary from movie subtitles based on inter-lingual triggers

Author: Langlois David
Lavecchia Caroline
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 29/11/2007
Field of study

International audienceThis paper focuses on two aspects of Machine Translation: parallel corpora and translation model. First, we present a method to automatically build parallel corpora from subtitle files. We use subtitle files gathered from the Internet. This leads to useful data for Subtitling Machine Translation. Our method is based on Dynamic Time Warping. We evaluated this alignment method by comparing it with a sample aligned by hand and we obtained a precision of alignment equal to

0.92

. Second, we use the notion of inter-lingual triggers in order to build from the subtitle parallel corpora multilingual dictionaries and translation tables for machine translation. Inter-lingual triggers allow to detect couple of source and target words from parallel corpora. The Mutual Information measure used to determine inter-lingual triggers allows to hypothesize that a word in the source language is a translation of another word in the target language. We evaluate the obtained dictionary by comparing it to two existing dictionaries. Then, we integrated the obtained translation tables into an entire translation decoding process supplied by Pharaoh. We compared the translation performance using our translation tables with the performance obtained by the Giza++ tool. The results showed that the system tuned for our tables improves the Bleu value by 2.2% compared to the ones obtained by Giza++

INRIA a CCSD electronic archive server

Discovering Phrases in Machine Translation by Simulated Annealing

Author: Langlois David
Lavecchia Caroline
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 22/09/2008
Field of study

International audienceIn this paper, we propose a new phrase-based translation model based on inter-lingual triggers. The originality of our method is double. First we identify common source. Then we use inter-lingual triggers in order to retrieve their translat ions. Furthermore, we consider the way of extracting phrase trans- lations as an optimization issue. For that we use simulated annealing algorithm to find out the best phrase translations among all those determined by inter-lingual triggers. The best phrases are those which improve the translation quality in terms of Bleu score. Tests are achieved on the proceedings of the European Parliament corpora. The training is made on a corpus containing 596K parallel sentences (French-English) and tests on a corpus of 1444 sentences. With only 8.1% of the identified source phrases occurring in the test corpus, our system overcomes the baseline model by almost 3 points

INRIA a CCSD electronic archive server

Building Parallel Corpora from Movies

Author: Langlois David
Lavecchia Caroline
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 12/06/2007
Field of study

International audienceThis paper proposes to use DTW to construct parallel corpora from difficult data. Parallel corpora are considered as raw material for machine translation (MT), frequently, MT systems use European or Canadian parliament corpora. In order to achieve a realistic machine translation system, we decided to use movie subtitles. These data could be considered difficult because they contain unfamiliar expressions, abbreviations, hesitations, words which do not exist in classical dictionaries (as vulgar words), etc. The obtained parallel corpora can constitute a rich ressource to train decoding spontaneous speech translation system. From 40 movies, we align 43013 English subtitles with 42306 French subtitles. This leads to 37625 aligned pairs with a precision of 92,3%

INRIA a CCSD electronic archive server

Une alternative aux modèles de traduction statistique d'IBM : Les triggers inter-langues

Author: Langlois David
Lavecchia Caroline
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 09/06/2008
Field of study

National audienceDans cet article, nous présentons une nouvelle approche pour la traduction automatique fondée sur les triggers inter-langues. Dans un premier temps, nous expliquons le concept de triggers inter-langues ainsi que la façon dont ils sont déterminés. Nous présentons ensuite les différentes expérimentations qui ont été menées à partir de ces triggers afin de les intégrer au mieux dans un processus complet de traduction automatique. Pour cela, nous construisons à partir des triggers inter-langues des tables de traduction suivant différentes méthodes. Nous comparons par la suite notre système de traduction fondé sur les triggers inter-langues à un système état de l'art reposant sur le modèle 3 d'IBM (Brown93). Les tests menés ont montré que les traductions automatiques générées par notre système améliorent le score BLEU (Papineni01) de 2,4% comparé à celles produites par le système état de l'art

INRIA a CCSD electronic archive server

Linguistic features modeling based on Partial New Cache

Author: Haton Jean-Paul
Lavecchia Caroline
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 01/05/2006
Field of study

The agreement in gender and number is a critical problem in statistical language modeling. One of the main problems in the speech recognition of French language is the presence of misrecognized words due to the bad agreement (in gender and number) between words. Statistical language models do not treat this phenomena directly. This paper focuses on how to handle the issue of agreements. We introduce an original model called Features-Cache (FC) to estimate the gender and the number of the word to predict. It is a dynamic variable-length Features-Cache for which the size is determined in accordance to syntagm delimitors. This model does not need any syntactic parsing, it is used as any other statistical language model. Several models have been carried out and the best one achieves an improvement of more than 8 points in terms of perplexity

INRIA a CCSD electronic archive server

How to handle gender and number agreement in statistical language models?

Author: Haton Jean-Paul
Lavecchia Caroline
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 17/09/2006
Field of study

The agreement in gender and number is a critical problem in statistical language modeling. One of the main difficulties in speech recognition of French language is the presence of misrecognized words due to the bad agreement (in gender and number) between words. Statistical language models do not treat this phenomena directly. This paper focuses on how to handle the issue of this agreement. We introduce an original model called Features-Cache (FC) to estimate the gender and the number of the word to predict. It is a dynamic variable-length Features-Cache. The size of the cache is automatically determined in accordance to syntagm delimitors. The main advantage of this model is that there is no need to any syntactic parsing: it is used as any other statistical language model. Several models have been carried out and the best one achieves an improvement of approximatively 9 points in terms of perplexity. This model has been integrated in a speech recognition system based on JULIUS engine. Tests have been carried out on 280 sentences provided by AUPELF for the French automatic speech recognition evaluation campaign. This new model outperforms the baseline one, in terms of word error, by 3%

INRIA a CCSD electronic archive server

New Confidence Measures for Statistical Machine Translation

Author: Langlois David
Lavecchia Caroline
Raybaud Sylvain
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 19/01/2009
Field of study

International audienceA confidence measure is able to estimate the reliability of an hypothesis provided by a machine translation system. The problem of confidence measure can be seen as a process of testing : we want to decide whether the most probable sequence of words provided by the machine translation system is correct or not. In the following we describe several original word-level confidence measures for machine translation, based on mutual information, n-gram language model and lexical features language model. We evaluate how well they perform individually or together, and show that using a combination of confidence measures based on mutual information yields a classification error rate as low as 25.1\% with an F-measure of 0.708

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Word- and sentence-level confidence measures for machine translation

Author: Langlois David
Lavecchia Caroline
Raybaud Sylvain
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 14/05/2009
Field of study

International audienceA machine translated sentence is seldom completely correct. Confidence measures are designed to detect incorrect words, phrases or sentences, or to provide an estimation of the probability of correctness. In this article we describe several word- and sentence-level confidence measures relying on different features: mutual information between words, n-gram and backward n-gram language models, and linguistic features. We also try different combination of these measures. Their accuracy is evaluated on a classification task. We achieve 17% error-rate (0.84 f-measure) on word-level and 31% error-rate (0.71 f-measure) on sentence-level

INRIA a CCSD electronic archive server

Using inter-lingual triggers for Machine translation

Author: Haton Jean-Paul
Langlois David
Lavecchia Caroline
Smaïli Kamel
Publication venue: 'The International Fiscal Association of Korea'
Publication date: 27/08/2007
Field of study

International audienceIn this paper, we present the idea of cross-lingual triggers. We exploit this formalism in order to build up a bilingual dictionary for machine translation. We describe the idea of cross-lingual triggers, the way to exploit and to make good use of them in order to produce a bilingual dictionary. We then compare it to ELRA and a free downloaded dictionaries. Finally, our dictionary is evaluated by comparing it to the one achieved by GIZA++ (which is an extension of the program GIZA) into an entire translation decoding process supplied by Pharaoh. The experiments showed that the obtained dictionary is well constructed and is suitable for machine translation. The experiments have been conducted on a parallel corpus of 19 million French words and of 17 million English words. Finally, the encouraging results allow us to put forward the concept of cross-lingual triggers which could have so many applications in machine translation

INRIA a CCSD electronic archive server